Data Brew: Brewing Success with Starbucks Customer Data
TEAM 6 - Chekitha Swayampu, Hrushikesh Sai Seshagiri Chowdary Uppalapati, Swathi Murali Srinivasan, Vaishnavi Tamilvanan
ABSTRACT
With the main goals of this extensive project being to maximize the effectiveness of promotional activities, improve overall satisfaction, and increase customer engagement, we conducted a thorough analysis of Starbucks’ marketing strategies. Our analysis involved a deep dive into Starbucks’ customer data, using a variety of visualizations and statistical methods to uncover insights.
Our initiative’s primary goal was to increase customer satisfaction and engagement. Our goal was to determine the best channels for promotions and tailor our products to each individual customer’s needs. We used a variety of sophisticated modeling methods, such as k-means clustering, logistic regression, decision trees, and support vector machines (SVM), to accomplish this. These models were crucial in helping us understand consumer preferences and behavior. Our strategy is expected to improve the efficacy and efficiency of Starbucks’ to improve the efficiency of guaranteeing the company’s sustained customer satisfaction and market dominance. It is fully documented in a R Markdown file in our GitHub repository.
INTRODUCTION
Our project examines the interactions and transactions of 17,000 customers using the Starbucks Customer Dataset. We use K-Means clustering for customer segmentation using data from responses (transcript.csv), customer demographics (profile.csv), and offers (portfolio.csv). With the use of this method, strategies can be more precisely targeted by identifying unique behavioral and demographic patterns. Through the analysis of purchase behaviors and offer responsiveness, our goal is to improve customer engagement and promotional efficacy at Starbucks.
SUMMARY OF DATASET
The data is contained in three files: portfolio.csv - data about offers sent to customers (10 offers x 6 columns) profile.csv - demographic data of customers (17,000 customers x 5 columns) transcript.csv - customer response to offers and transactions made (306,648 events x 4 columns)
DATASET OVERVIEW
# Load data
portfolio <- read.csv("data/portfolio.csv", row.names = 1)
profile <- read.csv("data/profile.csv", row.names = 1)
transcript <- read.csv("data/transcript.csv", row.names = 1)BEFORE CLEANING
PORTFOLIO
| reward | channels | difficulty | duration | offer_type | id | |
|---|---|---|---|---|---|---|
| 0 | 10 | [‘email’, ‘mobile’, ‘social’] | 10 | 7 | bogo | ae264e3637204a6fb9bb56bc8210ddfd |
| 1 | 10 | [‘web’, ‘email’, ‘mobile’, ‘social’] | 10 | 5 | bogo | 4d5c57ea9a6940dd891ad53e9dbe8da0 |
| 2 | 0 | [‘web’, ‘email’, ‘mobile’] | 0 | 4 | informational | 3f207df678b143eea3cee63160fa8bed |
| 3 | 5 | [‘web’, ‘email’, ‘mobile’] | 5 | 7 | bogo | 9b98b8c7a33c4b65b9aebfe6a799e6d9 |
| 4 | 5 | [‘web’, ‘email’] | 20 | 10 | discount | 0b1e1539f2cc45b7b9fa7c272da2e1d7 |
| 5 | 3 | [‘web’, ‘email’, ‘mobile’, ‘social’] | 7 | 7 | discount | 2298d6c36e964ae4a3e7e9706d1fb8c2 |
PROFILE
| gender | age | id | became_member_on | income | |
|---|---|---|---|---|---|
| 0 | 118 | 68be06ca386d4c31939f3a4f0e3dd783 | 20170212 | NA | |
| 1 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 20170715 | 112000 |
| 2 | 118 | 38fe809add3b4fcf9315a9694bb96ff5 | 20180712 | NA | |
| 3 | F | 75 | 78afa995795e4d85b5d9ceeca43f5fef | 20170509 | 100000 |
| 4 | 118 | a03223e636434f42ac4c3df47e8bac43 | 20170804 | NA | |
| 5 | M | 68 | e2127556f4f64592b11af22de27a7932 | 20180426 | 70000 |
TRANSCRIPT
| person | event | value | time | |
|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | offer received | {‘offer id’: ‘9b98b8c7a33c4b65b9aebfe6a799e6d9’} | 0 |
| 1 | a03223e636434f42ac4c3df47e8bac43 | offer received | {‘offer id’: ‘0b1e1539f2cc45b7b9fa7c272da2e1d7’} | 0 |
| 2 | e2127556f4f64592b11af22de27a7932 | offer received | {‘offer id’: ‘2906b810c7d4411798c6938adc9daaa5’} | 0 |
| 3 | 8ec6ce2a7e7949b1bf142def7d0e0586 | offer received | {‘offer id’: ‘fafdcd668e3743c1bb461111dcafc2a4’} | 0 |
| 4 | 68617ca6246f4fbc85e91a2a49552598 | offer received | {‘offer id’: ‘4d5c57ea9a6940dd891ad53e9dbe8da0’} | 0 |
| 5 | 389bc3fa690240e798340f5a15918d5c | offer received | {‘offer id’: ‘f19421c1d4aa40978ebb69ca19b0e20d’} | 0 |
DATA CLEANING
# Expand "channels" into binary columns of all different channels in the dataset (email, web, mobile, social)
library(dplyr)
library(stringr)
# Create binary columns for each channel
channels_list <- c('email', 'web', 'mobile', 'social')
portfolio_channels <- portfolio %>%
mutate(email = as.numeric(str_detect(channels, 'email')),
web = as.numeric(str_detect(channels, 'web')),
mobile = as.numeric(str_detect(channels, 'mobile')),
social = as.numeric(str_detect(channels, 'social')))
# Create binary columns for each offer type
portfolio_offertype <- portfolio %>%
mutate(bogo = as.numeric(offer_type == 'bogo'),
informational = as.numeric(offer_type == 'informational'),
discount = as.numeric(offer_type == 'discount'))
merged_portfolio <- merge(portfolio, portfolio_channels, by = "id", all.x = TRUE) %>%
merge(portfolio_offertype, by = "id", all.x = TRUE)
new_portfolio <- merged_portfolio %>%
select(reward, difficulty, duration, offer_type, id, bogo, discount, informational, email, mobile, social, web)
unique_ids <- unique(new_portfolio$id)
id_mapping <- setNames(seq_along(unique_ids), unique_ids)
new_portfolio$id <- id_mapping[new_portfolio$id]
# Checking for null values in each column
col_sums_null <- colSums(is.na(new_portfolio))
duplicated_rows <- new_portfolio[duplicated(new_portfolio), ]PROFILE
unique_ids <- unique(profile$id)
id_mapping <- setNames(seq_along(unique_ids), unique_ids)
profile$id <- id_mapping[profile$id]
na_counts <- colSums(is.na(profile))
cat("ORIGINAL NA VALUES:", na_counts)## ORIGINAL NA VALUES: 0 0 0 0 2175
# Remove rows with NA values and Age equal to 118
profile_new <- subset(profile, !is.na(age) & age != 118)
na_counts_after_cleaning <- colSums(is.na(profile_new))
cat("Count of NA values afer removing:",na_counts_after_cleaning)## Count of NA values afer removing: 0 0 0 0 0
TRANSCRIPT
library(dplyr)
# Extract offer_id from the 'value' column
transcript <- transcript %>%
mutate(offer_id = ifelse(grepl("'offer id'", value),
gsub(".*'offer id':\\s*'([[:alnum:]]+)'.*", "\\1", value),
ifelse(grepl("'offer_id'", value),
gsub(".*'offer_id':\\s*'([[:alnum:]]+)'.*", "\\1", value),
NA)))
# Create amount column
transcript <- transcript %>%
mutate(amount = ifelse(!is.na(str_extract(value, '"amount": ([0-9.]+)')),
as.numeric(str_extract(value, '"amount": ([0-9.]+)')), 0))
# Create reward_given column
transcript <- transcript %>%
mutate(reward_given = ifelse(!is.na(str_extract(value, '"reward": ([0-9]+)')),
as.numeric(str_extract(value, '"reward": ([0-9]+)')), 0))
# Remove value column
transcript <- select(transcript, -value)
if (!requireNamespace("digest", quietly = TRUE)) {
install.packages("digest")
}
library(digest)
# Function to convert person value to an integer
map_person_to_int <- function(person_value) {
# Calculate the hash value using SHA-256
hash_value <- digest(person_value, algo = "sha256", serialize = FALSE)
# Convert the hash value to a numeric representation
person_integer <- sum(as.integer(charToRaw(hash_value)))
return(person_integer)
}
transcript$person <- sapply(transcript$person, map_person_to_int)
# Create a mapping dictionary
unique_ids <- unique(transcript$offer_id)
id_mapping <- setNames(seq_along(unique_ids), unique_ids)
transcript$offer_id <- id_mapping[transcript$offer_id]
library(dplyr)
# Create binary columns for each event
event_list <- c('offer completed', 'offer received', 'offer viewed', 'transaction')
transcript_new <- transcript %>%
mutate(offer_completed = as.numeric(event == 'offer completed'),
offer_received = as.numeric(event == 'offer received'),
offer_viewed = as.numeric(event == 'offer viewed'),
transaction = as.numeric(event == 'transaction'))
# Display the updated data frame
head(transcript_new)| person | event | time | offer_id | amount | reward_given | offer_completed | offer_received | offer_viewed | transaction | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4182 | offer received | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 1 | 4547 | offer received | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 4272 | offer received | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 0 |
| 3 | 4550 | offer received | 0 | 4 | 0 | 0 | 0 | 1 | 0 | 0 |
| 4 | 4161 | offer received | 0 | 5 | 0 | 0 | 0 | 1 | 0 | 0 |
| 5 | 4438 | offer received | 0 | 6 | 0 | 0 | 0 | 1 | 0 | 0 |
PORTFOLIO
library(dplyr)
# Rename columns in the 'portfolio' data frame
new_portfolio <- new_portfolio %>%
rename(offer_id = id, offer_reward = reward)
transcript1 <- read.csv("data/transcript.csv", row.names =1)
profile1 <- read.csv("data/profile.csv", row.names = 1)
profile1$gender <- as.factor(profile1$gender)
# Impute missing values in 'income' with mean
profile1$income[is.na(profile1$income)] <- mean(profile1$income, na.rm = TRUE)
# Remove any non-numerc characters
profile1$became_member_on <- gsub("[^0-9]", "", profile1$became_member_on)
profile1$became_member_on <- as.Date(profile1$became_member_on, format = "%Y%m%d")
profile1$membership_duration <- as.numeric(difftime(Sys.Date(), profile1$became_member_on, units = "days"))
profile1$membership_duration <- as.numeric(difftime(Sys.Date(), profile1$became_member_on, units = "days"))
library(jsonlite)
# Extract offer id from 'value' column
transcript1$value <- gsub("'", "\"", transcript1$value) # Replace single quotes with double quotes
transcript1$offer_id <- sapply(transcript1$value, function(x) {
parsed_value <- fromJSON(x, simplifyVector = TRUE)
if (!is.null(parsed_value) && 'offer id' %in% names(parsed_value)) {
return(parsed_value[['offer id']])
} else {
return(NA)
}
})
# Create binary columns for different events
transcript1$offer_received <- as.integer(transcript1$event == "offer received")
transcript1$offer_viewed <- as.integer(transcript1$event == "offer viewed")
transcript1$offer_completed <- as.integer(transcript1$event == "offer completed")
transcript1$transaction <- as.integer(transcript1$event == "transaction")
# Rename columns in the 'transcript' data frame
transcript_new <- transcript_new %>%
rename(user_id = person)
# Rename columns in the 'profile' data frame
profile_new <- profile_new %>%
rename(user_id = id)
# Left join on 'offer_id'
full_df <- left_join(transcript_new, new_portfolio, by = 'offer_id')
# Inner join on 'user_id'
full_df <- inner_join(full_df, profile_new, by = 'user_id')
head(new_portfolio)| offer_reward | difficulty | duration | offer_type | offer_id | bogo | discount | informational | mobile | social | web | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 20 | 10 | discount | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 3 | 7 | 7 | discount | 2 | 0 | 1 | 0 | 1 | 1 | 1 | 1 |
| 2 | 10 | 7 | discount | 3 | 0 | 1 | 0 | 1 | 1 | 0 | 1 |
| 0 | 0 | 4 | informational | 4 | 0 | 0 | 1 | 1 | 1 | 0 | 1 |
| 10 | 10 | 5 | bogo | 5 | 1 | 0 | 0 | 1 | 1 | 1 | 1 |
| 0 | 0 | 3 | informational | 6 | 0 | 0 | 1 | 1 | 1 | 1 | 0 |
MERGED DATA
# Merge profile and transcript datasets based on 'id' (customer ID)
merged_data <- merge(profile1, transcript1, by.x = "id", by.y = "person", all.x = TRUE)
merged_data <- merge(merged_data, portfolio, by.x = "offer_id", by.y = "id", all.x = TRUE)
head(merged_data)| offer_id | id | gender | age | became_member_on | income | membership_duration | event | value | time | offer_received | offer_viewed | offer_completed | transaction | reward | channels | difficulty | duration | offer_type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0b1e1539f2cc45b7b9fa7c272da2e1d7 | 1314cb7b712649af9acf1ac19aadd267 | 118 | 2017-11-10 | 65405 | 2224 | offer received | {“offer id”: “0b1e1539f2cc45b7b9fa7c272da2e1d7”} | 504 | 1 | 0 | 0 | 0 | 5 | [‘web’, ‘email’] | 20 | 10 | discount | |
| 0b1e1539f2cc45b7b9fa7c272da2e1d7 | 2b91e3c00a424185ab261b2a204c9718 | M | 58 | 2017-09-16 | 53000 | 2279 | offer received | {“offer id”: “0b1e1539f2cc45b7b9fa7c272da2e1d7”} | 168 | 1 | 0 | 0 | 0 | 5 | [‘web’, ‘email’] | 20 | 10 | discount |
| 0b1e1539f2cc45b7b9fa7c272da2e1d7 | dffe25b1294f4166a72f73f57926214e | F | 61 | 2017-10-31 | 77000 | 2234 | offer viewed | {“offer id”: “0b1e1539f2cc45b7b9fa7c272da2e1d7”} | 348 | 0 | 1 | 0 | 0 | 5 | [‘web’, ‘email’] | 20 | 10 | discount |
| 0b1e1539f2cc45b7b9fa7c272da2e1d7 | bd920b7b033d42fb9d97be92343a9aef | F | 52 | 2017-01-12 | 97000 | 2526 | offer received | {“offer id”: “0b1e1539f2cc45b7b9fa7c272da2e1d7”} | 0 | 1 | 0 | 0 | 0 | 5 | [‘web’, ‘email’] | 20 | 10 | discount |
| 0b1e1539f2cc45b7b9fa7c272da2e1d7 | 9504588a36ed4c868759a7092cdfa7b1 | 118 | 2018-02-07 | 65405 | 2135 | offer viewed | {“offer id”: “0b1e1539f2cc45b7b9fa7c272da2e1d7”} | 360 | 0 | 1 | 0 | 0 | 5 | [‘web’, ‘email’] | 20 | 10 | discount | |
| 0b1e1539f2cc45b7b9fa7c272da2e1d7 | cc3ad71378a240a0841158e9aefe046e | M | 25 | 2017-02-20 | 35000 | 2487 | offer received | {“offer id”: “0b1e1539f2cc45b7b9fa7c272da2e1d7”} | 408 | 1 | 0 | 0 | 0 | 5 | [‘web’, ‘email’] | 20 | 10 | discount |
EXPLORATORY DATA ANALYSIS
Age and Gender
# Create a side-by-side boxplot and histogram
par(mfrow = c(1, 2), mar = c(5, 4, 4, 2))
boxplot(profile_new$age, xlab = "Age", main = "Boxplot", col = "lightblue")
hist(profile_new$age, xlab = "Age", main = "Histogram", col = "lightblue")# Adjust axis label sizes
par(cex.lab = 1.5)
# Print descriptive statistics
summary(profile_new$age)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.0 42.0 55.0 54.4 66.0 101.0
age_sd <- sd(profile_new$age)
# Print the standard deviation
cat("Standard Deviation of Age:", age_sd, "\n")## Standard Deviation of Age: 17.4
The customer age range spans from 18 years as the youngest to 101 years as the oldest.
The distribution of customer ages appears to approximate a normal distribution, with a mean and standard deviation of approximately 54 and 17, respectively.
Gender distribution.
library(ggplot2)
library(plotly)
# Create a data frame with the count of each gender category
gender_counts <- table(profile$gender)
# Calculate percentages
gender_percentages <- round((gender_counts / sum(gender_counts)) * 100, 1)
# Define custom colors
custom_colors <- c("#FF6F61", "#6B5B95", "#88B04B") # You can use any color codes you like
# Create the 3D pie chart
pie_chart <- plot_ly(
labels = names(gender_counts),
values = gender_counts,
type = "pie",
textinfo = "label+percent",
marker = list(colors = custom_colors),
pull = c(0.1, 0.1, 0.2) # Adjust pull for exploding wedges
) %>%
layout(
title = "Gender Distribution",
scene = list(
aspectmode = "cube", # Center the chart
camera = list(eye = list(x = 1.25, y = 1.25, z = 0.85)) # 3D view settings
),
showlegend = FALSE
)
# Display the 3D pie chart
pie_chart- The customer base consists of a larger proportion of males (57.2%) compared to females (41.3%), with a minor representation (1.4%) from customers identifying with other genders.
Data distribution among different events.
library(ezids)
# Get value counts for the 'event' column
event_value_counts <- table(transcript_new$event)
# EDA on event occurences
# Load the required libraries
library(ggplot2)
library(plotly)
# Create a data frame from the event counts
event_counts_df <- data.frame(event = names(event_value_counts), count = as.numeric(event_value_counts))
# Define a vector of colors for four events
event_colors <- c("Coral", "Cyan", "Magenta", "Turquoise")
# Create a ggplot2 bar chart
p <- ggplot(event_counts_df, aes(x = event, y = count, fill = event)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = event_colors) +
labs(title = "Event Distribution", x = "Event", y = "Count") +
theme_minimal() +
scale_y_continuous(breaks = seq(0, max(event_counts_df$count), by = 20000)) +
coord_flip() +
theme(panel.border = element_rect(color = "black", fill = NA),
panel.grid = element_blank(),
axis.text.x = element_text(face = "bold", color = "black"),
axis.text.y = element_text(face = "bold", color = "black"))
# Make the ggplot2 chart interactive using plotly
interactive_plot <- ggplotly(p)
# Display the interactive plot
interactive_plotAs expected, not all recieved offers were viewed and not all recieved offers were completed. The dataset contains 45% transaction events and 55% offer events.
Percentages of each offer type sent.
colors <- c("#ad6a6c", "#d0ada7", "#e8d6cb")
# Count the frequency of each offer type
offer_counts <- table(new_portfolio$offer_type)
ggplot(data = as.data.frame(offer_counts), aes(x = Var1, y = Freq)) +
geom_bar(stat = "identity", fill = colors, color = "black") +
labs(title = "Frequency of Each Offer Type",
x = "Offer Type",
y = "Count") +
scale_fill_manual(values = colors)From the above observation BOGO and the discount offer type has the maximum count of nearly 66,000 and 62,000 respectively while the informational offer had the least count.
# Load the ggplot2 library
library(ggplot2)
# Create a scatterplot
ggplot(profile_new, aes(x = gender, y = income, color = gender)) +
geom_jitter(width = 0.3, alpha = 0.7, size = 1) +
labs(title = "Scatterplot of Income vs Gender", x = "Gender", y = "Income") +
scale_color_manual(values = c("#E41A1C", "#377EB8", "#4DAF4A")) + # Custom colors
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
panel.grid.major.y = element_line(color = "gray90"),
axis.text.x = element_text(angle = 45, hjust = 1))Male customers have a right-skewed income distribution, indicating that a larger proportion of male customers falls within the lower half of the income spectrum among the company’s customer base.
Male customers have a right-skewed income distribution, indicating that a larger proportion of male customers falls within the lower half of the income spectrum among the company’s customer base.
Female customers have a significantly higher average income compared to customers of other genders. This disparity in income may be attributed to the assumption that female customers, on average, are older than customers in other gender groups.
On average, female customers have an income of $71,000, while male customers have an average income of $61,000. Customers of other genders have an average income of $63,000.
Income vs Gender vs Age
# Load necessary libraries
library(plotly)
library(dplyr)
# Create a 3D scatterplot with Plotly
scatter_3d <- profile_new %>%
plot_ly(x = ~age, y = ~income, z = ~gender, color = ~gender, colors = c("#E41A1C", "#377EB8", "#4DAF4A"),
type = "scatter3d", mode = "markers",
marker = list(size = 2)) %>%
layout(scene = list(xaxis = list(title = "Age"),
yaxis = list(title = "Income"),
zaxis = list(title = "Gender")))
# Show the 3D scatterplot
scatter_3dIncome tends to increase with age for both men and women. This is evident from the fact that the scatterplot shows a general upward trend from left to right.
Another interesting thing we can observe is that female customers with higher incomes, on average, are older than customers in other gender groups.
The customer base consists of a larger proportion of males (57.2%) compared to females (41.3%), with a minor representation (1.4%) from customers identifying with other genders. The customer age range spans from 18 years as the youngest to 101 years as the oldest. The income of customers spans from 30k to 120k, with an average income of 65.4k. The income distribution closely mirrors that of the general population. There is a logical relationship between age and cafe visits, as individuals between 46 and 75 years tend to have more available time to visit cafes. The income distribution varies between genders, with females having higher average incomes
DATA MODELLING
SMART Q1: How can we design a precise predictive model to classify customer responses to offers as successful or not?
Now that we have analyzed the dataset, we will proceed by creating a model that would predict whether a user will respond to an offer or not. There are 4 scenarios that can happen:
A user will view and complete the offer. A user will just view the offer. A user will not view the offer, but will complete it anyway (without prior knowledge of the offer existence) A user will not view the offer and will not complete it.
Since starbucks are targetting users that will view the offer and complete it afterwards, our prediction would be a binary value as such: 1: User will view and complete the offer 0: Otherwise
In order to proceed with the prediction, we will need to create a new dataframe that will include the targeted features and the prediction column. The features that will be analyzed are:
Age Income Gender Offer_type Reward Duration Difficulty Channels
A new column will be created “offer_success” to show wehther a user will successfully view and complete the offer.
From the datasets, Offer types BOGO and discount have a clear criteria for completion and can be founded by looking at the event column with value “completed offer” and then double check that the timing of completion and viewing and offer expiration are consistent.
However, dealing with informational offers is different. Since informational offers are advertisement offers that don’t have a completion criteria, we will need to define how to consider them successful.
One way would be to look at all transactions and check if a transaction has occurred during an informational offer period. These transactions are considered to be influenced by the offer and thus the informational offer was successful. This is, of course, under the condition that a user has received and viewed the informational offer, THEN proceed to make a transaction.
In order to do that, create one dataset that includes all pairs (user_id and offer_id) of completed offers for Bogo and discount. Also create another dataset that includes all pairs (user_id and offer_id) of completed offers for type informational.
Next, will merge them together into a bigger dataset including all successfully completed offers.
library(dplyr)
# Filter offer_received_df
offer_received_df <- transcript_new %>%
filter(offer_received == 1) %>%
select(offer_id, user_id, time) %>%
rename(time_received = time)
# Filter offer_viewed_df
offer_viewed_df <- transcript_new %>%
filter(offer_viewed == 1) %>%
select(offer_id, user_id, time)
# Filter offer_completed_df
offer_completed_df <- transcript_new %>%
filter(offer_completed == 1) %>%
select(offer_id, user_id, time)
# Merge offer_completed_df and offer_viewed_df
complete_bogo_discount_df <- merge(offer_completed_df, offer_viewed_df, by = c('offer_id', 'user_id'))
# Merge with offer_received_df
complete_bogo_discount_df <- merge(complete_bogo_discount_df, offer_received_df, by = c('offer_id', 'user_id'))
complete_bogo_discount_df <- complete_bogo_discount_df %>%
rename(time_completed = time.x, time_viewed = time.y)
complete_bogo_discount_df <- merge(complete_bogo_discount_df, new_portfolio, by = 'offer_id')
# Display the resulting data frame
head(complete_bogo_discount_df)| offer_id | user_id | time_completed | time_viewed | time_received | offer_reward | difficulty | duration | offer_type | bogo | discount | informational | mobile | social | web | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 3841 | 624 | 588 | 576 | 5 | 20 | 10 | discount | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 1 | 3841 | 624 | 588 | 336 | 5 | 20 | 10 | discount | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 1 | 3883 | 504 | 576 | 504 | 5 | 20 | 10 | discount | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 1 | 3925 | 60 | 18 | 0 | 5 | 20 | 10 | discount | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 1 | 3925 | 60 | 18 | 168 | 5 | 20 | 10 | discount | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 1 | 3925 | 60 | 228 | 0 | 5 | 20 | 10 | discount | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
library(dplyr)
complete_bogo_discount_df <- complete_bogo_discount_df %>%
mutate(time_expire = time_received + duration * 24)
# Display the resulting data frame
head(complete_bogo_discount_df)| offer_id | user_id | time_completed | time_viewed | time_received | offer_reward | difficulty | duration | offer_type | bogo | discount | informational | mobile | social | web | time_expire | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 3841 | 624 | 588 | 576 | 5 | 20 | 10 | discount | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 816 |
| 1 | 3841 | 624 | 588 | 336 | 5 | 20 | 10 | discount | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 576 |
| 1 | 3883 | 504 | 576 | 504 | 5 | 20 | 10 | discount | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 744 |
| 1 | 3925 | 60 | 18 | 0 | 5 | 20 | 10 | discount | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 240 |
| 1 | 3925 | 60 | 18 | 168 | 5 | 20 | 10 | discount | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 408 |
| 1 | 3925 | 60 | 228 | 0 | 5 | 20 | 10 | discount | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 240 |
library(dplyr)
complete_bogo_discount_df <- complete_bogo_discount_df %>%
filter(
time_received <= time_viewed,
time_viewed <= time_completed,
time_completed <= time_expire
)
# Display the resulting data frame
head(complete_bogo_discount_df)| offer_id | user_id | time_completed | time_viewed | time_received | offer_reward | difficulty | duration | offer_type | bogo | discount | informational | mobile | social | web | time_expire | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 3841 | 624 | 588 | 576 | 5 | 20 | 10 | discount | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 816 |
| 1 | 3925 | 60 | 18 | 0 | 5 | 20 | 10 | discount | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 240 |
| 1 | 3926 | 42 | 24 | 0 | 5 | 20 | 10 | discount | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 240 |
| 1 | 3984 | 462 | 438 | 408 | 5 | 20 | 10 | discount | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 648 |
| 1 | 3995 | 240 | 174 | 0 | 5 | 20 | 10 | discount | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 240 |
| 1 | 3995 | 240 | 174 | 168 | 5 | 20 | 10 | discount | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 408 |
We have created the first dataset, we will repeat similar logic to create second dataset.
library(dplyr)
# Dataframe holding the events where a transaction took place
transaction_df <- transcript_new %>%
filter(transaction == 1) %>%
select(user_id, time, amount)
# Merge transaction and offer_viewed dataframes
complete_info_df <- left_join(transaction_df, offer_viewed_df, by = 'user_id')
# Merge with offer_received_df
complete_info_df <- left_join(complete_info_df, offer_received_df, by = c('offer_id', 'user_id'))
complete_info_df <- complete_info_df %>%
rename(time_transaction = time.x, time_viewed = time.y)
# Merge with new_portfolio
complete_info_df <- left_join(complete_info_df, new_portfolio, by = 'offer_id')
# Calculate offer expiration
complete_info_df <- complete_info_df %>%
mutate(time_expire = time_received + duration * 24)
# Choose only informational offer
complete_info_df <- filter(complete_info_df, informational == 1)
#filter based on the mentioned criteria
complete_info_df <- complete_info_df %>%
filter(
time_viewed >= time_received,
time_transaction <= time_expire,
time_viewed <= time_transaction
)
complete_bogo_discount_df <- mutate(complete_bogo_discount_df, offer_success = 1)
complete_info_df <- mutate(complete_info_df, offer_success = 1)
complete_bogo_discount_df <- complete_bogo_discount_df %>%
select(user_id, offer_id, offer_success)
complete_info_df <- complete_info_df %>%
select(user_id, offer_id, offer_success)
head(complete_info_df)| user_id | offer_id | offer_success |
|---|---|---|
| 4665 | 4 | 1 |
| 4665 | 4 | 1 |
| 4598 | 6 | 1 |
| 4598 | 6 | 1 |
| 4598 | 6 | 1 |
| 4400 | 6 | 1 |
| user_id | offer_id | offer_success |
|---|---|---|
| 3841 | 1 | 1 |
| 3925 | 1 | 1 |
| 3926 | 1 | 1 |
| 3984 | 1 | 1 |
| 3995 | 1 | 1 |
| 3995 | 1 | 1 |
library(dplyr)
df1 <- offer_received_df %>%
select(offer_id, user_id) %>%
distinct() # Ensure unique pairs
df2 <- concat_r %>%
select(offer_id, user_id, offer_success)
all_clean_df <- left_join(df1, df2, by = c('offer_id', 'user_id'))
# Consider the other offers as unsuccessful
all_clean_df$offer_success <- ifelse(is.na(all_clean_df$offer_success), 0, all_clean_df$offer_success)
all_clean_df <- data.frame(all_clean_df, row.names = NULL)
# Subset Columns
model_df <- all_clean_df %>%
select(user_id, offer_id, offer_success)
# Merge with 'portfolio' DataFrame
model_df <- left_join(model_df, new_portfolio, by = 'offer_id')
# Merge with 'profile' DataFrame
model_df <- left_join(model_df, profile_new, by = 'user_id')
# Drop Columns
model_df <- model_df %>%
select(-became_member_on, -offer_type)
head(model_df)| user_id | offer_id | offer_success | offer_reward | difficulty | duration | bogo | discount | informational | mobile | social | web | gender | age | income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4182 | 1 | 1 | 5 | 20 | 10 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | M | 58 | 61000 |
| 4182 | 1 | 1 | 5 | 20 | 10 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | M | 58 | 61000 |
| 4182 | 1 | 1 | 5 | 20 | 10 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | M | 58 | 61000 |
| 4182 | 1 | 1 | 5 | 20 | 10 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | M | 58 | 61000 |
| 4182 | 1 | 1 | 5 | 20 | 10 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | M | 58 | 61000 |
| 4547 | 2 | 1 | 3 | 7 | 7 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | F | 30 | 31000 |
normalize_data <- function(df, column) {
# Min-Max Normalization
df[[column]] <- (df[[column]] - min(df[[column]])) / (max(df[[column]]) - min(df[[column]]))
# Return the modified data frame
return(df)
}
# Create Integer Mapping for 'gender' Column
gender_levels <- unique(model_df$gender)
gender_map <- setNames(as.integer(seq_along(gender_levels)), gender_levels)
model_df$gender <- gender_map[model_df$gender]
# Normalize 'age' and 'income' Columns
normalize_data <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
model_df$age <- normalize_data(model_df$age)
model_df$income <- normalize_data(model_df$income)Drop specified columns
columns_to_drop <- c('gender', 'age', 'income')
model_df <- model_df[, !(names(model_df) %in% columns_to_drop)]
head(model_df)| user_id | offer_id | offer_success | offer_reward | difficulty | duration | bogo | discount | informational | mobile | social | web | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4182 | 1 | 1 | 5 | 20 | 10 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 4182 | 1 | 1 | 5 | 20 | 10 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 4182 | 1 | 1 | 5 | 20 | 10 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 4182 | 1 | 1 | 5 | 20 | 10 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 4182 | 1 | 1 | 5 | 20 | 10 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 4547 | 2 | 1 | 3 | 7 | 7 | 0 | 1 | 0 | 1 | 1 | 1 | 1 |
Three classifier algorithms will be used
1.LinearSVC 2.Decision Tree Classifier 3.k-nearest neighbors
To evaluate a model, we will look into f1-score. The F1 score can be interpreted as a weighted average of the precision and recall which conveys the balance between them.
Looking at Precision value alone would ignore the False Negatives and would make us miss valuable customers that can potentially complete an offer.
Similarity, looking at Recall value alone would ignore False Positives which can make us send offers to everyone and flood users with offers they are not interested in.
For that, F1 score is the best choice in this case as it provides the balance between them.
# Load libraries
library(dplyr)
library(glmnet)
library(caret)
# Select relevant features based on feature importance
selected_features <- c("offer_reward", "difficulty", "duration", "bogo", "discount", "email", "mobile", "social", "web")
formula <- as.formula(paste("offer_success ~", paste(selected_features, collapse = " + ")))
# Split the data into training and testing sets
set.seed(123) # Set seed for reproducibility
sample_index <- sample(nrow(model_df), 0.8 * nrow(model_df))
train_data <- model_df[sample_index, ]
test_data <- model_df[-sample_index, ]
formula <- as.formula("offer_success ~ offer_reward + difficulty + duration + bogo + discount + informational + email + mobile + social + web")
# Train the logistic regression model
model_lm_old <- glm(formula, data = train_data, family = "binomial")
# Make predictions on the test set
predictions <- predict(model_lm_old, newdata = test_data, type = "response")
# Set a threshold to handle class imbalance
threshold <- 0.5
binary_predictions <- ifelse(predictions > threshold, 1, 0)
# Evaluate the model
conf_matrix <- table(binary_predictions, test_data$offer_success)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
conf_matrix_caret <- confusionMatrix(data = as.factor(binary_predictions), reference = as.factor(test_data$offer_success))
print(conf_matrix)##
## binary_predictions 0 1
## 0 345 0
## 1 226 315211
## [1] "Accuracy: 0.9993"
## [1] "\n***Logistic regression***\n"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 345 0
## 1 226 315211
##
## Accuracy : 0.999
## 95% CI : (0.999, 0.999)
## No Information Rate : 0.998
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.753
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.60420
## Specificity : 1.00000
## Pos Pred Value : 1.00000
## Neg Pred Value : 0.99928
## Prevalence : 0.00181
## Detection Rate : 0.00109
## Detection Prevalence : 0.00109
## Balanced Accuracy : 0.80210
##
## 'Positive' Class : 0
##
OBSERVATION:
In this logistic regression model, we aim to predict
offer_success based on features such as
offer_reward, difficulty,
duration, and communication channels. Despite achieving a
high accuracy of 99.93%, the model’s effectiveness is questionable due
to several concerns:
Class Imbalance: The dataset shows a significant imbalance between classes, which could lead to a bias towards predicting the majority class, thereby inflating accuracy.
Sensitivity Concerns: The low sensitivity (60.42%) indicates the model’s limited ability in correctly identifying true positives for the minority class.
Potential Overfitting: The high accuracy might be indicative of overfitting, especially if the test set does not represent the overall population diversity.
Metric Reliability: Sole reliance on accuracy is misleading in imbalanced datasets; alternative metrics like F1-score or AUC-ROC are recommended for a more comprehensive evaluation.
In summary, while the model shows high accuracy, its real-world applicability is limited by issues like class imbalance, potential overfitting, and an over-reliance on accuracy as the primary performance metric. Addressing these concerns is crucial for developing a more robust and reliable predictive model.
# install.packages("ROSE")
library(ROSE)
library(dplyr)
library(glmnet)
library(caret)
selected_features <- c("offer_reward", "difficulty", "duration", "bogo", "discount", "informational", "mobile", "social", "web")
formula <- as.formula(paste("offer_success ~", paste(selected_features, collapse = " + ")))
set.seed(123) # Set seed for reproducibility
sample_index <- sample(nrow(model_df), 0.8 * nrow(model_df))
train_data <- model_df[sample_index, ]
test_data <- model_df[-sample_index, ]
# Address Class Imbalance: Use ROSE for oversampling the minority class
oversampled_train_data <- ROSE(formula, data = train_data, p = 0.5, seed = 123)$data
formula <- as.formula(paste("offer_success ~", paste(selected_features, collapse = " + ")))
# Train the logistic regression model with regularization (elastic net)
model_lm_new <- cv.glmnet(x = as.matrix(oversampled_train_data[, selected_features, drop = FALSE]), # Ensure matrix format
y = oversampled_train_data$offer_success,
alpha = 0.5, # Adjust alpha for desired elastic net mixing
family = "binomial",
standard.error = TRUE)
# Make predictions on the test set
predictions <- predict(model_lm_new, newx = as.matrix(test_data[, selected_features, drop = FALSE]), s = "lambda.min", type = "response")
# Set a threshold to handle class imbalance
threshold <- 0.5
binary_predictions <- ifelse(predictions > threshold, 1, 0)
# Evaluate the model
conf_matrix <- table(binary_predictions, test_data$offer_success)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
conf_matrix_caret <- confusionMatrix(data = as.factor(binary_predictions), reference = as.factor(test_data$offer_success))
print(conf_matrix)##
## binary_predictions 0 1
## 0 542 37998
## 1 29 277213
## [1] "Accuracy: 0.8796"
## [1] "\n***Logistic regression with regularization***\n"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 542 37998
## 1 29 277213
##
## Accuracy : 0.88
## 95% CI : (0.878, 0.881)
## No Information Rate : 0.998
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.024
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.94921
## Specificity : 0.87945
## Pos Pred Value : 0.01406
## Neg Pred Value : 0.99990
## Prevalence : 0.00181
## Detection Rate : 0.00172
## Detection Prevalence : 0.12205
## Balanced Accuracy : 0.91433
##
## 'Positive' Class : 0
##
## Length Class Mode
## lambda 80 -none- numeric
## cvm 80 -none- numeric
## cvsd 80 -none- numeric
## cvup 80 -none- numeric
## cvlo 80 -none- numeric
## nzero 80 -none- numeric
## call 6 -none- call
## name 1 -none- character
## glmnet.fit 13 lognet list
## lambda.min 1 -none- numeric
## lambda.1se 1 -none- numeric
## index 2 -none- numeric
OBSERVATION:
In this improved logistic regression model, we addressed previous limitations through regularization and class imbalance correction. We employed the ROSE package for oversampling the minority class in our training data, leading to a more balanced dataset. Additionally, we utilized the glmnet package to apply elastic net regularization, a combination of L1 and L2 regularization techniques, enhancing the model’s ability to generalize and reducing the risk of overfitting.
The new model’s performance was evaluated using a confusion matrix and accuracy. The accuracy decreased to 87.96%, a more realistic figure considering the previously imbalanced nature of the dataset. Notably, the sensitivity increased to 94.92%, indicating a significantly improved ability to correctly identify true positives for the minority class. The specificity also remained high at 87.95%, confirming the model’s effectiveness in identifying true negatives.
Key improvements in this model include:
Handling Class Imbalance: Oversampling the minority class created a more balanced training dataset, allowing the model to learn more effectively from both classes.
Regularization: The use of elastic net regularization helped in reducing overfitting by penalizing complex models, thus enhancing the model’s generalizability.
Realistic Accuracy: The adjusted accuracy presents a more realistic view of the model’s performance in a balanced scenario.
Improved Sensitivity: The substantial increase in sensitivity demonstrates the model’s enhanced capability in detecting the minority class.
Overall, these adjustments led to a more robust and reliable logistic regression model, better suited for practical applications due to its improved handling of class imbalance and overfitting.
# Plot ROC curve
library(pROC)
roc_curve <- roc(test_data$offer_success, predictions)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
# Calculate and print AUC value
auc_value <- auc(roc_curve)
legend("bottomright", legend = paste("AUC =", round(auc_value, 4)), col = "blue", lwd = 2)OBSERVATION: The ROC-AUC curve displayed represents the performance of the improved logistic regression model. The curve illustrates the trade-off between sensitivity (true positive rate) and specificity (true negative rate) across different thresholds. The AUC (Area Under the Curve) value is a summary measure of the curve; in this model, the AUC is 0.9727.
An AUC of 0.9727 suggests a high degree of separability between the positive and negative classes, indicating that the model is capable of distinguishing between the two with high accuracy. The closer the AUC is to 1, the better the model is at predicting 0s as 0s and 1s as 1s. An AUC value significantly higher than 0.5, which would represent a random guess, indicates that the model has a strong predictive performance.
In summary, the ROC-AUC curve and the high AUC value of 0.9727 reflect the model’s robustness and its improved ability to identify both classes accurately, signifying a substantial advancement from the initial logistic regression approach.
# Load libraries
library(dplyr)
library(rpart)
library(caret)
# Split the data into training and testing sets
set.seed(123) # Set seed for reproducibility
sample_index <- sample(nrow(model_df), 0.8 * nrow(model_df))
train_data <- model_df[sample_index, ]
test_data <- model_df[-sample_index, ]
formula <- as.formula("offer_success ~ user_id + offer_id + offer_reward + difficulty + duration + bogo + discount + informational + email + mobile + social + web")
# Train the Decision Tree model
model_dtree_old <- rpart(formula, data = train_data, method = "class")
# Make predictions on the test set
predictions <- predict(model_dtree_old, newdata = test_data, type = "class")
# Evaluate the model
conf_matrix <- table(predictions, test_data$offer_success)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
# Use caret's confusionMatrix function for a detailed report
conf_matrix_caret <- confusionMatrix(data = as.factor(predictions), reference = as.factor(test_data$offer_success))
print(conf_matrix)##
## predictions 0 1
## 0 400 56
## 1 171 315155
## [1] "Accuracy: 0.9993"
## [1] "\n***Decision Tree***\n"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 400 56
## 1 171 315155
##
## Accuracy : 0.999
## 95% CI : (0.999, 0.999)
## No Information Rate : 0.998
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.779
##
## Mcnemar's Test P-Value : 3.84e-14
##
## Sensitivity : 0.70053
## Specificity : 0.99982
## Pos Pred Value : 0.87719
## Neg Pred Value : 0.99946
## Prevalence : 0.00181
## Detection Rate : 0.00127
## Detection Prevalence : 0.00144
## Balanced Accuracy : 0.85017
##
## 'Positive' Class : 0
##
OBSERVATION:
The Decision Tree model was constructed to predict offer_success using features including user demographics, offer details, and the modes of communication used to send the offer. Despite an apparently high accuracy of 99.93%, similar to the logistic regression model, this measure may not reflect true predictive performance due to several underlying issues:
Class Imbalance: The model was trained on a dataset with a significant class imbalance, which can lead to a bias towards the majority class. This is evident in the confusion matrix, where predictions for the minority class (0) are substantially less frequent.
Overfitting Potential: Decision Trees are prone to overfitting, especially when they are deep and complex. This can result in high accuracy on the training data but poor generalization to new data.
Kappa Statistic: Although the Kappa statistic of 0.7786 suggests a fair agreement beyond chance, it is still important to consider the impact of class imbalance on this metric.
Mcnemar’s Test: The p-value from Mcnemar’s test indicates a statistically significant difference between the model’s classification errors for the two classes, suggesting that the model may not perform equally well across both classes.
Sensitivity and Specificity: The sensitivity of 70.05% indicates that the model has moderate ability to correctly identify true positives for the minority class. However, the high specificity of 99.98% shows that the model is very good at identifying true negatives.
# install.packages("ROSE")
library(ROSE)
library(rpart)
# Subset data with top features
top_features <- c("offer_reward", "difficulty", "duration", "bogo", "mobile", "social", "web")
train_data <- train_data[, c(top_features, "offer_success")]
test_data <- test_data[, c(top_features, "offer_success")]
# Set seed for reproducibility
set.seed(123)
# Define the model formula with top features
formula <- as.formula("offer_success ~ offer_reward + difficulty + duration + bogo + mobile + social + web")
# Apply random oversampling to the training data
oversampled_object <- ovun.sample(formula, data = train_data, method = "over", N = 2 * sum(train_data$offer_success == 1), seed = 123)
# Extract the balanced dataset from the oversampled object
train_data_balanced <- oversampled_object$data
# Train the Decision Tree model on the balanced data
model_dtree_new <- rpart(formula, data = train_data_balanced, method = "class", control = rpart.control(cp = 0.01))
# Make predictions on the test set
predictions <- predict(model_dtree_new, newdata = test_data, type = "class")
# Evaluate the model
conf_matrix <- table(predictions, test_data$offer_success)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
conf_matrix_caret <- confusionMatrix(data = as.factor(predictions), reference = as.factor(test_data$offer_success))
print(conf_matrix)##
## predictions 0 1
## 0 542 37998
## 1 29 277213
## [1] "Accuracy: 0.8796"
## [1] "\n***Decision Tree with random oversampling***\n"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 542 37998
## 1 29 277213
##
## Accuracy : 0.88
## 95% CI : (0.878, 0.881)
## No Information Rate : 0.998
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.024
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.94921
## Specificity : 0.87945
## Pos Pred Value : 0.01406
## Neg Pred Value : 0.99990
## Prevalence : 0.00181
## Detection Rate : 0.00172
## Detection Prevalence : 0.12205
## Balanced Accuracy : 0.91433
##
## 'Positive' Class : 0
##
OBSERVATION:
for the improved decisoin tree model: The enhanced Decision Tree
model was developed to address the shortcomings of the initial model,
notably the class imbalance and potential for overfitting. To this end,
we applied the ROSE package to perform oversampling, creating a balanced
representation of classes in the training data. We also refined the
model by selecting the most relevant features and incorporated
regularization techniques by setting the complexity parameter
(cp) and the minimum number of observations that must exist
in a node for a split to be attempted (minsplit).
The model’s accuracy post-improvement stands at 87.96%, lower than the initial model but more indicative of its true performance given the balanced dataset. The substantial improvement in sensitivity to 94.92% demonstrates the model’s enhanced capability in correctly identifying the minority class, which is critical in imbalanced datasets.
Furthermore, the application of pruning parameters in the decision tree has helped to mitigate overfitting by avoiding an overly complex tree that perfectly fits the training data at the expense of generalization.
In conclusion, the improved Decision Tree model represents a more accurate and reliable classifier than its predecessor. By addressing class imbalance through oversampling and incorporating regularization to combat overfitting, we have achieved a model that not only performs better on balanced data but also has a greater potential to generalize to unseen data. This has been substantiated by a more balanced accuracy, indicating improved model performance across both classes.
CONCLUSION: Based on the provided information and the AUC values from the ROC curves, the improved logistic regression model with an AUC of 0.9727 outperforms the improved Decision Tree model with an AUC of 0.9143. The higher AUC value for the logistic regression model indicates a superior ability to distinguish between the successful and unsuccessful offers. This is a significant metric in evaluating model performance, particularly in classification tasks where the objective is to maximize the correct identification of both positive and negative classes while minimizing the error rates.
Moreover, the logistic regression model displayed a balanced accuracy and an improved sensitivity over the initial model, which further suggests its enhanced capability to generalize and perform well on unseen data. In conclusion, the improved logistic regression model is deemed better in terms of the metrics used for evaluation in this context.
# Install and load necessary packages if not already installed
# install.packages(c("pROC", "caret", "ROCR"))
library(pROC)
library(caret)
library(ROCR)
# Function to plot ROC curve
plotROC <- function(predictions, labels, main = "ROC Curve") {
roc_curve <- roc(labels, as.numeric(predictions))
auc_value <- auc(roc_curve)
plot(roc_curve, main = main, col = "blue", lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "red") # Diagonal line for reference
legend("bottomright", legend = paste("AUC =", round(auc_value, 4)), col = "blue", lwd = 2)
}
# Plot ROC curve
plotROC(predictions, test_data$offer_success, main = "ROC Curve for Decision Tree Model")
OBSERVATION:
The ROC-AUC curve presented for the improved Decision Tree model demonstrates its classification strength with an AUC of 0.9143. This value indicates a strong ability of the model to differentiate between the classes of ‘offer success’. An AUC close to 1.0 suggests excellent model performance, and at 0.9143, the model is considered very good, particularly in comparison to a model with random guessing (AUC of 0.5). This AUC value signifies that the model has a high true positive rate across various thresholds while maintaining a low false positive rate, showcasing its effectiveness in the classification task after the enhancements were made.
SMART Q2: Can we predict which customer segment a new customer is likely to belong to based on their demographics?
cluster_data <- merged_data %>%
group_by(id) %>%
summarise(
age = first(age),
income = first(income),
num_transactions = sum(transaction, na.rm = TRUE),
total_offers_received = sum(offer_received),
total_offers_viewed = sum(offer_viewed),
total_offers_completed = sum(offer_completed),
membership_duration = first(membership_duration),
gender = first(gender)
)
scaled_data <- scale(cluster_data[, c("age", "income", "num_transactions", "total_offers_received", "total_offers_viewed", "total_offers_completed", "membership_duration")])
# Determine the optimal number of clusters (k) using the elbow method
wss <- numeric(10)
for (i in 1:10) {
wss[i] <- sum(kmeans(scaled_data, centers = i)$withinss)
}
# Plot the elbow method to find the optimal k
plot(1:10, wss, type = "b", xlab = "Number of Clusters (k)", ylab = "Within Sum of Squares (WSS)")# Based on the plot, choose the optimal k (elbow point)
optimal_k <- 4 # Adjust this based on the plot
# Print the result of the elbow method
cat("Optimal number of clusters (k) based on the elbow method:", optimal_k, "\n")## Optimal number of clusters (k) based on the elbow method: 4
# Apply K-Means clustering with the optimal k
kmeans_model <- kmeans(scaled_data, centers = optimal_k)
# Add the cluster labels to the original dataset
cluster_data$cluster <- as.factor(kmeans_model$cluster)
# Analyze the characteristics of each cluster
cluster_profiles <- cluster_data %>%
group_by(cluster) %>%
summarise_all(mean)
print(cluster_profiles)## # A tibble: 4 × 10
## cluster id age income num_transactions total_offers_received
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 NA 113. 64566. 6.48 4.58
## 2 2 NA 61.6 86837. 6.59 4.83
## 3 3 NA 50.5 57117. 5.64 3.84
## 4 4 NA 49.7 52688. 14.3 4.90
## # ℹ 4 more variables: total_offers_viewed <dbl>, total_offers_completed <dbl>,
## # membership_duration <dbl>, gender <dbl>
# Based on the plot, choose the optimal k (elbow point)
optimal_k <- 4 # Adjust this based on the plot
# Apply K-Means clustering with the optimal k
kmeans_model <- kmeans(scaled_data, centers = optimal_k)
# Add the cluster labels to the original dataset
cluster_data$cluster <- as.factor(kmeans_model$cluster)
# Analyze the characteristics of each cluster
cluster_profiles <- cluster_data %>%
group_by(cluster) %>%
summarise_all(mean)
print(cluster_profiles)## # A tibble: 4 × 10
## cluster id age income num_transactions total_offers_received
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 NA 50.5 57117. 5.64 3.84
## 2 2 NA 49.7 52688. 14.3 4.90
## 3 3 NA 61.6 86837. 6.59 4.83
## 4 4 NA 113. 64566. 6.48 4.58
## # ℹ 4 more variables: total_offers_viewed <dbl>, total_offers_completed <dbl>,
## # membership_duration <dbl>, gender <dbl>
## NULL
library(caret)
library(e1071)
set.seed(123) # Set seed for reproducibility
index <- createDataPartition(cluster_data$cluster, p = 0.8, list = FALSE)
train_data <- cluster_data[index, ]
test_data <- cluster_data[-index, ]
table(cluster_data$cluster)
# Train SVM with class weights
svm_model <- svm(cluster ~ age + income + num_transactions + total_offers_received + total_offers_viewed + total_offers_completed + membership_duration + gender,
data = cluster_data,
kernel = "radial",
class.weights = table(cluster_data$cluster) / nrow(cluster_data))
# Make predictions on the test set
predictions <- predict(svm_model, newdata = test_data)
# Evaluate the model
confusion_matrix <- table(predictions, test_data$cluster)
print(confusion_matrix)
# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy: ", accuracy))Observations: Effective clustering is demonstrated by the confusion matrix, which shows excellent accuracy 97% and few misclassifications across four groups.
# Assuming 'confusion_matrix' is your confusion matrix
# Extract the elements from the confusion matrix
tp <- confusion_matrix[2, 2] # True Positives
fp <- confusion_matrix[1, 2] # False Positives
# Calculate precision
precision <- tp / (tp + fp)
# Print precision
print(paste("Precision: ", precision))## [1] "Precision: 0.992814371257485"
Observations: With an excellent precision, positive prediction reliability and false positive rates are also low.
tp <- confusion_matrix[2, 2] # True Positives
fn <- confusion_matrix[2, 1] # False Negatives
tn <- confusion_matrix[1, 1] # True Negatives
fp <- confusion_matrix[1, 2] # False Positives
# Calculate sensitivity (recall)
sensitivity <- tp / (tp + fn)
# Calculate specificity
specificity <- tn / (tn + fp)
# Print sensitivity and specificity
print(paste("Sensitivity: ", sensitivity))## [1] "Sensitivity: 1"
## [1] "Specificity: 0.994579945799458"
Observations: The model shows excellent specificity and sensitivity, highlighting balanced performance in detecting real positives and negatives.
CONCLUSION
SMART Q1
To design a precise predictive model for classifying customer responses to offers as successful or not, we implemented and iteratively improved logistic regression and decision tree models. By addressing class imbalance with techniques like oversampling and applying regularization methods to reduce overfitting, we enhanced the models’ predictive accuracies and generalization capabilities. The logistic regression model, after refinement, yielded an AUC of 0.9727, indicating a high ability to differentiate between classes and showing strong model performance. Meanwhile, the improved decision tree model achieved an AUC of 0.9143, which, while very good, was less indicative of an optimal performance than the logistic regression model.
Evaluating both models’ AUC values, sensitivities, specificities, and overall accuracies, the improved logistic regression model emerged as the more precise classifier for predicting customer responses. The combination of data preprocessing, feature selection, model evaluation, and fine-tuning led to the development of a robust predictive model that meets the objective of accurately classifying customer responses to offers.
SMART Q2
- Following the analysis using the Elbow method, we separated the entire set of data into four optimal clusters and then performed K-means clustering.
- The SVM model is then trained to anticipate cluster membership. The confusion matrix shows an excellent accuracy , with precision, sensitivity, and specificity all above 99%. These are very encouraging results.
To conclude, these metrics show that the model performs well in detecting both positive and negative information, and it is very effective at classifying data into the appropriate clusters with few errors. This degree of precision and accuracy shows that the model is strong and trustworthy when it comes to generating predictions using the dataset.
REFERENCES
Nandamuri, P., & Gowthami, C. (2015). Influence of Consumer Demographics on Attitude Towards Branded Products-An Exploratory Study on Consumer Durables in Rural Markets. SSRN.
Valerevna, I. S. (2022). Brand Evolution Based on Innovation: Starbucks Coffee Company Case Study.